Methods
Exploratory Data Analysis (EDA)
Inspect Data
head(london, 5)
## realSum room_type room_shared room_private person_capacity
## 1 121.1223 Private room False True 2
## 2 195.9124 Private room False True 2
## 3 193.3253 Private room False True 3
## 4 180.3899 Private room False True 2
## 5 405.7010 Entire home/apt False False 3
## host_is_superhost multi biz cleanliness_rating guest_satisfaction_overall
## 1 False 0 0 6 69
## 2 False 1 0 10 96
## 3 False 1 0 10 95
## 4 False 1 0 9 87
## 5 False 0 1 7 65
## bedrooms dist metro_dist attr_index attr_index_norm rest_index
## 1 1 5.734117 0.4370940 222.8822 15.49341 470.0885
## 2 1 4.788905 1.4640505 235.3858 16.36259 530.1335
## 3 1 4.596677 0.4503062 268.9138 18.69325 548.9876
## 4 1 2.054769 0.1326705 472.3813 32.83707 1021.2711
## 5 0 4.491277 0.3541075 318.4915 22.13958 692.7754
## rest_index_norm lng lat
## 1 8.413765 -0.04975 51.52570
## 2 9.488466 -0.08475 51.54210
## 3 9.825922 -0.14585 51.54802
## 4 18.278973 -0.10611 51.52108
## 5 12.399473 -0.18797 51.49399
The structure of the dataset
str(london)
## 'data.frame': 5379 obs. of 19 variables:
## $ realSum : num 121 196 193 180 406 ...
## $ room_type : chr "Private room" "Private room" "Private room" "Private room" ...
## $ room_shared : chr "False" "False" "False" "False" ...
## $ room_private : chr "True" "True" "True" "True" ...
## $ person_capacity : num 2 2 3 2 3 2 2 2 4 2 ...
## $ host_is_superhost : chr "False" "False" "False" "False" ...
## $ multi : int 0 1 1 1 0 0 0 0 0 1 ...
## $ biz : int 0 0 0 0 1 1 1 1 1 0 ...
## $ cleanliness_rating : num 6 10 10 9 7 9 10 9 9 10 ...
## $ guest_satisfaction_overall: num 69 96 95 87 65 93 97 88 87 97 ...
## $ bedrooms : int 1 1 1 1 0 0 1 1 1 1 ...
## $ dist : num 5.73 4.79 4.6 2.05 4.49 ...
## $ metro_dist : num 0.437 1.464 0.45 0.133 0.354 ...
## $ attr_index : num 223 235 269 472 318 ...
## $ attr_index_norm : num 15.5 16.4 18.7 32.8 22.1 ...
## $ rest_index : num 470 530 549 1021 693 ...
## $ rest_index_norm : num 8.41 9.49 9.83 18.28 12.4 ...
## $ lng : num -0.0498 -0.0848 -0.1459 -0.1061 -0.188 ...
## $ lat : num 51.5 51.5 51.5 51.5 51.5 ...
The summary statistics
summary(london)
## realSum room_type room_shared room_private
## Min. : 54.33 Length:5379 Length:5379 Length:5379
## 1st Qu.: 174.51 Class :character Class :character Class :character
## Median : 268.12 Mode :character Mode :character Mode :character
## Mean : 364.39
## 3rd Qu.: 438.27
## Max. :12937.27
## person_capacity host_is_superhost multi biz
## Min. :2.000 Length:5379 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.000 Class :character 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Mode :character Median :0.0000 Median :0.0000
## Mean :2.858 Mean :0.2798 Mean :0.3579
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :6.000 Max. :1.0000 Max. :1.0000
## cleanliness_rating guest_satisfaction_overall bedrooms
## Min. : 2.000 Min. : 20.00 Min. :0.000
## 1st Qu.: 9.000 1st Qu.: 87.00 1st Qu.:1.000
## Median :10.000 Median : 94.00 Median :1.000
## Mean : 9.194 Mean : 90.92 Mean :1.133
## 3rd Qu.:10.000 3rd Qu.: 99.00 3rd Qu.:1.000
## Max. :10.000 Max. :100.00 Max. :8.000
## dist metro_dist attr_index attr_index_norm
## Min. : 0.04056 Min. :0.01388 Min. : 68.74 Min. : 4.778
## 1st Qu.: 3.54568 1st Qu.:0.32404 1st Qu.: 177.22 1st Qu.: 12.320
## Median : 4.93914 Median :0.53613 Median : 247.65 Median : 17.215
## Mean : 5.32762 Mean :1.01653 Mean : 294.58 Mean : 20.477
## 3rd Qu.: 6.83807 3rd Qu.:1.09076 3rd Qu.: 361.07 3rd Qu.: 25.099
## Max. :17.32120 Max. :9.17409 Max. :1438.56 Max. :100.000
## rest_index rest_index_norm lng lat
## Min. : 140.5 Min. : 2.515 Min. :-0.25170 Min. :51.41
## 1st Qu.: 382.1 1st Qu.: 6.839 1st Qu.:-0.16996 1st Qu.:51.49
## Median : 527.3 Median : 9.439 Median :-0.11813 Median :51.51
## Mean : 625.6 Mean : 11.197 Mean :-0.11478 Mean :51.50
## 3rd Qu.: 764.2 3rd Qu.: 13.678 3rd Qu.:-0.06772 3rd Qu.:51.53
## Max. :5587.1 Max. :100.000 Max. : 0.12018 Max. :51.58
Handle Missing Data
colSums(is.na(london))
## realSum room_type
## 0 0
## room_shared room_private
## 0 0
## person_capacity host_is_superhost
## 0 0
## multi biz
## 0 0
## cleanliness_rating guest_satisfaction_overall
## 0 0
## bedrooms dist
## 0 0
## metro_dist attr_index
## 0 0
## attr_index_norm rest_index
## 0 0
## rest_index_norm lng
## 0 0
## lat
## 0
Numerical/Categorical variables
target = 'realSum'
numerical_vars = c('dist', 'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'cleanliness_rating', 'guest_satisfaction_overall') #sapply(london, is.numeric)
categorical_vars = names(london)[!names(london) %in% union(numerical_vars, target)]
Identify Outliers
# Interquartile Range method (IQR)
get_outliers_idx = function(data, method = "iqr", threshold = 1) {
if (method == "iqr") {
Q1 = quantile(data, 0.25)
Q3 = quantile(data, 0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
idx = which(data < lower_bound | data > upper_bound)
}
else if (method == "z-score") {
z_scores = abs(scale(data))
idx = which(z_scores <= threshold)
}
else {
stop("method not found")
}
return(idx)
}
Remove outliers
idx = c()
for (col in c(numerical_vars, target)) {
idx = union(idx, get_outliers_idx(london[[col]]))
}
london = london[-idx,]
Data Visualization: numerical variables
# par(mfrow = c(1, length(numerical_vars)))
for (col in numerical_vars) {
hist(london[[col]], main = col, xlab = "Value", ylab = "Frequency", col = "steelblue")
}










The relationship between numerical variables
Pairplots
pairs(london[numerical_vars], col = "steelblue")

Correlation
round(cor(london[numerical_vars]), 2)
## dist metro_dist attr_index attr_index_norm
## dist 1.00 0.38 -0.90 -0.90
## metro_dist 0.38 1.00 -0.41 -0.41
## attr_index -0.90 -0.41 1.00 1.00
## attr_index_norm -0.90 -0.41 1.00 1.00
## rest_index -0.89 -0.46 0.93 0.93
## rest_index_norm -0.89 -0.46 0.93 0.93
## lng 0.02 0.27 0.05 0.05
## lat -0.27 -0.17 0.17 0.17
## cleanliness_rating 0.09 0.11 -0.09 -0.09
## guest_satisfaction_overall 0.13 0.18 -0.15 -0.15
## rest_index rest_index_norm lng lat
## dist -0.89 -0.89 0.02 -0.27
## metro_dist -0.46 -0.46 0.27 -0.17
## attr_index 0.93 0.93 0.05 0.17
## attr_index_norm 0.93 0.93 0.05 0.17
## rest_index 1.00 1.00 -0.02 0.26
## rest_index_norm 1.00 1.00 -0.02 0.26
## lng -0.02 -0.02 1.00 0.27
## lat 0.26 0.26 0.27 1.00
## cleanliness_rating -0.10 -0.10 -0.04 -0.06
## guest_satisfaction_overall -0.16 -0.16 -0.02 -0.04
## cleanliness_rating guest_satisfaction_overall
## dist 0.09 0.13
## metro_dist 0.11 0.18
## attr_index -0.09 -0.15
## attr_index_norm -0.09 -0.15
## rest_index -0.10 -0.16
## rest_index_norm -0.10 -0.16
## lng -0.04 -0.02
## lat -0.06 -0.04
## cleanliness_rating 1.00 0.67
## guest_satisfaction_overall 0.67 1.00
library(corrplot)
## corrplot 0.92 loaded
cor_mx = cor(london[numerical_vars])
corrplot(cor_mx, method = "color")

Data Visualization: Categorical variables
for (col in categorical_vars) {
boxplot(realSum ~ london[[col]], data = london,
main = paste0("realSum vs ", col),
xlab = col,
ylab = "Real Sum",
col = "steelblue")
}








Unique values
for (cat_var in categorical_vars) {
print(unique(london[[cat_var]]))
}
## [1] "Private room" "Entire home/apt" "Shared room"
## [1] "False" "True"
## [1] "True" "False"
## [1] 2 3 4 6 5
## [1] "False" "True"
## [1] 1 0
## [1] 0 1
## [1] 1 0 2 3 5 4
Transform to factor variables
london[categorical_vars] = lapply(london[categorical_vars], as.factor)
Encoding categorical variables
london$room_shared = ifelse(london$room_shared == "True", 1, 0)
london$room_private = ifelse(london$room_private == "True", 1, 0)
london$host_is_superhost = ifelse(london$host_is_superhost == "True", 1, 0)
london$room_type = as.numeric(london$room_type)
Exclude collinear predictors
cor(london$attr_index, london$attr_index_norm)
## [1] 1
cor(london$rest_index, london$rest_index_norm)
## [1] 1
london = subset(london, select = -c(attr_index, rest_index, room_shared, room_private))
The distribution of the target variable
num_bins = 15
bin_width = (max(london$realSum) - min(london$realSum)) / num_bins
breakpoints = seq(min(london$realSum), max(london$realSum), by = bin_width)
hist(london$realSum[london$realSum < max(london$realSum)],
main = "Distribution of realSum",
xlab = "realSum",
ylab = "Frequency",
breaks = breakpoints,
col = "steelblue")

Main Effects Analysis
Split data into train/test
idx = sample(1:nrow(london), floor(0.8 * nrow(london)))
train_data = london[idx, ]
test_data = london[-idx, ]
target = "realSum"
X_train = train_data[, !(names(train_data) %in% target)]
y_train = train_data[[target]]
X_test = test_data[, !(names(test_data) %in% target)]
y_test = test_data[[target]]
Function to calculate RMSE
calc_rmse = function(actual_values, predicted_values) {
rmse = sqrt(mean((actual_values - predicted_values) ^ 2))
return(rmse)
}
Performs diagnostic calculations for linear models
diag = function(model, name){
resid = fitted(model) - london$realSum
rmse = sqrt(sum(resid^2)/nrow(london))
r2 = summary(model)$r.squared
coefs = nrow(summary(model)$coeff)
data.frame(model = name, rmse = rmse, r2 = r2, coefficients = coefs )
}
Fit a full additive model
full_additive_mod = lm(realSum ~ ., data = london)
summary(full_additive_mod)
##
## Call:
## lm(formula = realSum ~ ., data = london)
##
## Residuals:
## Min 1Q Median 3Q Max
## -295.34 -58.97 -12.87 40.07 633.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2079.4323 3025.2963 0.687 0.491904
## room_type -156.8349 3.8600 -40.631 < 2e-16 ***
## person_capacity3 22.6491 5.4863 4.128 3.73e-05 ***
## person_capacity4 62.1017 4.9795 12.472 < 2e-16 ***
## person_capacity5 70.2959 9.7317 7.223 6.06e-13 ***
## person_capacity6 121.6284 8.8126 13.802 < 2e-16 ***
## host_is_superhost 4.8053 4.1339 1.162 0.245140
## multi1 8.8460 3.8212 2.315 0.020667 *
## biz1 19.7699 4.1121 4.808 1.58e-06 ***
## cleanliness_rating 23.1461 2.7877 8.303 < 2e-16 ***
## guest_satisfaction_overall 0.7166 0.2979 2.406 0.016186 *
## bedrooms1 53.7163 6.0696 8.850 < 2e-16 ***
## bedrooms2 155.9215 8.1215 19.199 < 2e-16 ***
## bedrooms3 182.8555 15.0762 12.129 < 2e-16 ***
## bedrooms4 110.3803 36.0825 3.059 0.002235 **
## bedrooms5 61.4314 93.7029 0.656 0.512121
## dist 6.7243 1.9839 3.389 0.000707 ***
## metro_dist -12.2260 3.6522 -3.348 0.000823 ***
## attr_index_norm 1.4684 0.6252 2.349 0.018892 *
## rest_index_norm 9.9394 1.1618 8.555 < 2e-16 ***
## lng -156.2512 29.0213 -5.384 7.71e-08 ***
## lat -40.4128 58.6179 -0.689 0.490595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.43 on 3920 degrees of freedom
## Multiple R-squared: 0.6757, Adjusted R-squared: 0.674
## F-statistic: 389 on 21 and 3920 DF, p-value: < 2.2e-16
linear_exp_results = diag(full_additive_mod, "full_additive_mod")
linear_exp_results
## model rmse r2 coefficients
## 1 full_additive_mod 93.16959 0.675745 22
Fitting an additive model with no interactions to the data yields an
\(R^2 = 0.67574\). We will attempt to
improve on this.
We will use AIC to determine if any of the predictors can be
dropped.
aic_full_additive_mod = step(full_additive_mod, direction = "backward", trace = FALSE)
summary(aic_full_additive_mod)
##
## Call:
## lm(formula = realSum ~ room_type + person_capacity + multi +
## biz + cleanliness_rating + guest_satisfaction_overall + bedrooms +
## dist + metro_dist + attr_index_norm + rest_index_norm + lng,
## data = london)
##
## Residuals:
## Min 1Q Median 3Q Max
## -295.58 -58.80 -13.08 40.54 634.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.4669 31.9898 -0.390 0.696768
## room_type -156.4758 3.8493 -40.651 < 2e-16 ***
## person_capacity3 22.2907 5.4796 4.068 4.84e-05 ***
## person_capacity4 61.8693 4.9763 12.433 < 2e-16 ***
## person_capacity5 69.9986 9.7284 7.195 7.43e-13 ***
## person_capacity6 121.4798 8.8110 13.787 < 2e-16 ***
## multi1 9.2028 3.8062 2.418 0.015659 *
## biz1 19.7525 4.1108 4.805 1.61e-06 ***
## cleanliness_rating 23.5866 2.7669 8.525 < 2e-16 ***
## guest_satisfaction_overall 0.7458 0.2959 2.521 0.011756 *
## bedrooms1 53.7896 6.0692 8.863 < 2e-16 ***
## bedrooms2 156.0994 8.1202 19.224 < 2e-16 ***
## bedrooms3 182.9747 15.0753 12.137 < 2e-16 ***
## bedrooms4 109.9049 36.0759 3.046 0.002331 **
## bedrooms5 60.1752 93.6959 0.642 0.520755
## dist 7.0379 1.9240 3.658 0.000258 ***
## metro_dist -11.7688 3.5824 -3.285 0.001028 **
## attr_index_norm 1.6054 0.5854 2.743 0.006125 **
## rest_index_norm 9.7581 1.1283 8.649 < 2e-16 ***
## lng -164.8004 26.5292 -6.212 5.78e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.43 on 3922 degrees of freedom
## Multiple R-squared: 0.6756, Adjusted R-squared: 0.674
## F-statistic: 429.9 on 19 and 3922 DF, p-value: < 2.2e-16
row = diag(aic_full_additive_mod, "aic_full_additive_mod")
row
## model rmse r2 coefficients
## 1 aic_full_additive_mod 93.19119 0.6755946 20
linear_exp_results = rbind(linear_exp_results, row)
AIC eliminates just host_is_superhost, and lat. The \(R^2\) value of this model left after AIC
elimination is almost the same at 0.6755946 as compared to the full
additive model with an \(R^2\) of
0.67574. \(RMSE\) is about the
same.
We will now try BIC to see if this results in a smaller model.
bic_full_additive_mod = step(full_additive_mod, direction = "backward", k = log(nrow(london)), trace = FALSE)
summary(bic_full_additive_mod)
##
## Call:
## lm(formula = realSum ~ room_type + person_capacity + biz + cleanliness_rating +
## bedrooms + metro_dist + rest_index_norm + lng, data = london)
##
## Residuals:
## Min 1Q Median 3Q Max
## -298.66 -59.69 -12.90 40.73 627.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.397 23.232 3.805 0.000144 ***
## room_type -154.393 3.809 -40.538 < 2e-16 ***
## person_capacity3 22.383 5.474 4.089 4.42e-05 ***
## person_capacity4 62.129 4.974 12.490 < 2e-16 ***
## person_capacity5 70.469 9.731 7.242 5.31e-13 ***
## person_capacity6 122.391 8.792 13.920 < 2e-16 ***
## biz1 13.563 3.513 3.861 0.000115 ***
## cleanliness_rating 27.906 2.172 12.847 < 2e-16 ***
## bedrooms1 52.790 6.056 8.717 < 2e-16 ***
## bedrooms2 154.977 8.105 19.121 < 2e-16 ***
## bedrooms3 182.615 15.063 12.123 < 2e-16 ***
## bedrooms4 105.746 36.140 2.926 0.003453 **
## bedrooms5 58.434 93.930 0.622 0.533908
## metro_dist -12.342 3.575 -3.453 0.000561 ***
## rest_index_norm 9.665 0.447 21.625 < 2e-16 ***
## lng -150.363 26.000 -5.783 7.90e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.69 on 3926 degrees of freedom
## Multiple R-squared: 0.6734, Adjusted R-squared: 0.6722
## F-statistic: 539.8 on 15 and 3926 DF, p-value: < 2.2e-16
row = diag(bic_full_additive_mod, "bic_full_additive_model")
row
## model rmse r2 coefficients
## 1 bic_full_additive_model 93.50005 0.6734407 16
linear_exp_results = rbind(linear_exp_results, row)
BIC, as we would expect, produces a smaller model.
As there is little difference between these models, we will continue
the analysis with the full linear model. In summary, no one simple
additive model stands out.
Finding the Best Additive Model Using Brute Force
We performed a brute force exploration of all linear additive models
and obtained the model below. The model was searched for by lowest RMSE.
The code is not shown as it takes quite some to run.
brute_force = lm(realSum ~ room_type + person_capacity + multi + biz + bedrooms + metro_dist + rest_index_norm + lat, data = london)
row = diag(brute_force, "brute_force")
row
## model rmse r2 coefficients
## 1 brute_force 95.70173 0.6578804 16
linear_exp_results = rbind(linear_exp_results, row)
Collinearity Analysis
Although collinearity likely will not affect prediction, it will
affect our ability to perform inference tests.
Let us check for collinearity using the variance inflation
factor.
Calculate VIF
library(faraway)
vif(full_additive_mod)
## room_type person_capacity3
## 1.712194 1.113846
## person_capacity4 person_capacity5
## 1.649836 1.231677
## person_capacity6 host_is_superhost
## 1.792595 1.127032
## multi1 biz1
## 1.335385 1.738527
## cleanliness_rating guest_satisfaction_overall
## 1.839646 2.065436
## bedrooms1 bedrooms2
## 2.830461 3.179536
## bedrooms3 bedrooms4
## 1.462706 1.042173
## bedrooms5 dist
## 1.005581 6.482776
## metro_dist attr_index_norm
## 1.490631 10.692268
## rest_index_norm lng
## 9.227965 1.406944
## lat
## 1.397655
vif(full_additive_mod)[unname(vif(full_additive_mod)) > 5]
## dist attr_index_norm rest_index_norm
## 6.482776 10.692268 9.227965
Per the textbook, predictors with a VIF of more than 5 are suspicious
for collinearity. The variables with VIF values more than 5 include
dist, attr_index_norm, and rest_index_norm. This makes sense as dist is
the distance to the city center, and it is likely that if this is small,
the number of attractions and restaurants (as measured by
attr_index_norm and rest_index_norm) would be high. There are more
attractions and restaurants near the city center. Also, metro_dist, the
distance to a metro station, would likely be small if dist is small.
These values are not tremendously larger than 5, so we will keep them in
model.
Transformation Analysis
What does the distribution of the room prices look like?
hist(london$realSum,
main = "Distribution of realSum",
xlab = "realSum", col = "steelblue")

Maybe this would look better if we took the logarithm. This is a
typical transformation for prices that can vary over several orders of
magnitude.
hist(log(london$realSum), prob = TRUE,
main = "Distribution of log(realSum)",
xlab = "log(realSum", col = "steelblue")
curve(dnorm(x, mean = mean(log(london$realSum)), sd = sd(log(london$realSum))),
col = "darkorange", add = TRUE, lwd = 3)

This does look much better. The values are more spread out, and
approximate a normal distribution. Let us try to fit a model with the
log transformation of the response.
Function to perform diagnostics on functions with log
transformation of response
log_diag = function(model, name){
resid = exp(fitted(model)) - london$realSum
rmse = sqrt(sum(resid^2)/nrow(london))
r2 = summary(model)$r.squared
coefs = nrow(summary(model)$coeff)
data.frame(model = name, rmse = rmse, r2 = r2, coefficients = coefs )
}
Fit a model with log transformation of response and all available
predictors
full_log_model = lm(log(realSum) ~ ., data = london)
summary(full_log_model)
##
## Call:
## lm(formula = log(realSum) ~ ., data = london)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86979 -0.19445 -0.02429 0.16259 1.61758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1417549 9.2756335 0.339 0.73485
## room_type -0.5453170 0.0118349 -46.077 < 2e-16 ***
## person_capacity3 0.0989368 0.0168212 5.882 4.40e-09 ***
## person_capacity4 0.2042732 0.0152671 13.380 < 2e-16 ***
## person_capacity5 0.2409053 0.0298376 8.074 8.98e-16 ***
## person_capacity6 0.3037975 0.0270196 11.244 < 2e-16 ***
## host_is_superhost 0.0272539 0.0126747 2.150 0.03159 *
## multi1 0.0302205 0.0117160 2.579 0.00993 **
## biz1 0.0411765 0.0126078 3.266 0.00110 **
## cleanliness_rating 0.0813609 0.0085470 9.519 < 2e-16 ***
## guest_satisfaction_overall 0.0018751 0.0009132 2.053 0.04011 *
## bedrooms1 0.1227469 0.0186095 6.596 4.79e-11 ***
## bedrooms2 0.3496422 0.0249007 14.041 < 2e-16 ***
## bedrooms3 0.4276488 0.0462242 9.252 < 2e-16 ***
## bedrooms4 0.3303286 0.1106298 2.986 0.00285 **
## bedrooms5 0.2309309 0.2872954 0.804 0.42156
## dist 0.0145217 0.0060826 2.387 0.01702 *
## metro_dist -0.0496261 0.0111976 -4.432 9.60e-06 ***
## attr_index_norm 0.0048881 0.0019170 2.550 0.01081 *
## rest_index_norm 0.0294681 0.0035620 8.273 < 2e-16 ***
## lng -0.6087828 0.0889799 -6.842 9.04e-12 ***
## lat 0.0314020 0.1797240 0.175 0.86131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2865 on 3920 degrees of freedom
## Multiple R-squared: 0.6889, Adjusted R-squared: 0.6873
## F-statistic: 413.4 on 21 and 3920 DF, p-value: < 2.2e-16
transform_results = log_diag(full_log_model, "full_log_model")
transform_results
## model rmse r2 coefficients
## 1 full_log_model 93.69018 0.6889451 22
The log transformation gives a better \(R^2\). The \(R^2\) value increased from around 0.6757450
with the linear model using all predictors to 0.6889451 with the log
transform. The log model also has lower RMSE.
Does the Box-Cox transformation work better than log transformation
of the response?
Inverse of Box-Cox to display results
invBoxCox = function(x, lambda)
if (lambda == 0) exp(x) else (lambda*x + 1)^(1/lambda)
Perform Box-Cox transform
library(MASS)
bc = boxcox(full_additive_mod)

(lambda = bc$x[which.max(bc$y)])
## [1] 0.02020202
bc_model = lm(((realSum^lambda-1)/lambda) ~ ., data = london)
resid = invBoxCox(fitted(bc_model), lambda) - london$realSum
rmse = sqrt(sum(resid^2)/nrow(london))
r2 = summary(bc_model)$r.squared
coefs = nrow(summary(bc_model)$coeff)
row = data.frame(model = "bc_model", rmse = rmse, r2 = r2, coefficients = coefs )
row
## model rmse r2 coefficients
## 1 bc_model 93.59655 0.6895857 22
rbind(transform_results, row)
## model rmse r2 coefficients
## 1 full_log_model 93.69018 0.6889451 22
## 2 bc_model 93.59655 0.6895857 22
The Box-Cox transform did not help significantly.
We will now try transformations of other variables. Distance may have
a skewed distribution, with being very close to the city center being
more important.
hist(log(london$dist),
main = "Distribution of dist",
xlab = "dist", col = "steelblue",
prob = TRUE)

hist(sqrt(london$dist),
main = "Distribution of sqrt*dist)",
xlab = "sqrt(dist)", col = "steelblue", prob = TRUE)
curve(dnorm(x, mean = mean(sqrt(london$dist)), sd = sd(sqrt(london$dist))),
col = "darkorange", add = TRUE, lwd = 3)

The transform of distance may allow for a better model. We will try a
number of transformations.
Interaction Analysis
Next do a model with interactions, and then reduce the number of
variables with backward AIC. We will start with the full log model.
Fit a full interaction model
full_interact = lm(log(realSum) ~ .^2, data = london)
interact_results = log_diag(full_interact, "full_interact")
interact_results
## model rmse r2 coefficients
## 1 full_interact 87.85819 0.7242581 187
The full interaction model has a higher \(R^2\) value than any of the transformation
models above. We will have to check if there is overfitting. The
full_interact model has a significantly lower \(RMSE\) than the transformation models. The
\(RMSE\) is about 5 units lower for the
interaction model than the transformation models.
The full interaction model is quite large. Let us use AIC and BIC to
reduce the number of predictors.
AIC decreases the number of coefficients from 214 in the full
interaction model to 113 with a decrease in \(R^2\) from 0.7242581 to 0.7199981, and an
increase in \(RMSE\) from 87.85819 to
88.8297. This is with a decrease of on the order of 100 predictors.
Try BIC to get a smaller model
BIC decreases the number of coefficients from 214 in the full
interaction model to 45 with a decrease in \(R^2\) from 0.7242581 to 0.705297, and an
increase in \(RMSE\) from 87.85819 to
91.23548. This is with a decrease of on the order of 157 predictors.
Putting It Together
Try a model with transformations and interactions. Start with the
model from the BIC elimination above and add the sqrt(rest_index_norm)
transformation.
Fit the combined model
combined_model = lm(log(realSum) ~ room_type + person_capacity + multi +
biz + cleanliness_rating + guest_satisfaction_overall + bedrooms +
dist + metro_dist + attr_index_norm + rest_index_norm + lng +
lat + room_type:multi + room_type:cleanliness_rating + biz:lat +
guest_satisfaction_overall:rest_index_norm + dist:lng + dist:lat +
metro_dist:lng + metro_dist:lat + attr_index_norm:rest_index_norm + sqrt(rest_index_norm), data = london)
combined_results = log_diag(combined_model, "combined_model")
combined_results
## model rmse r2 coefficients
## 1 combined_model 91.09935 0.7068179 31
Adding sqrt(rest_index_norm) doesn’t really change model
perfomance.
Let us remove predictors to increase explanatory power. I removed
predictors with collinearity and low p-values
smaller_combined_model = lm(log(realSum) ~ room_type + person_capacity + guest_satisfaction_overall
+ bedrooms + dist + attr_index_norm + lng +
+ multi:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = london)
vif(smaller_combined_model)
## room_type person_capacity3
## 1.682144 1.103167
## person_capacity4 person_capacity5
## 1.615566 1.214803
## person_capacity6 guest_satisfaction_overall
## 1.745485 1.099646
## bedrooms1 bedrooms2
## 2.825979 3.121398
## bedrooms3 bedrooms4
## 1.453304 1.038902
## bedrooms5 dist
## 1.005199 5.766369
## attr_index_norm lng
## 5.577048 1.171724
## guest_satisfaction_overall:multi1 dist:rest_index_norm
## 1.051256 1.224161
## attr_index_norm:metro_dist
## 1.230701
row = log_diag(smaller_combined_model, "smaller_combined_model")
row
## model rmse r2 coefficients
## 1 smaller_combined_model 94.71747 0.6811681 18
combined_results = rbind(combined_results, row)
The smaller combined model has much fewer predictors, but does suffer
in performance.
Overall comparison of models
sum = rbind(linear_exp_results, interact_results)
sum = rbind(sum, combined_results)
sum = rbind(sum, transform_results)
sum
## model rmse r2 coefficients
## 1 full_additive_mod 93.16959 0.6757450 22
## 2 aic_full_additive_mod 93.19119 0.6755946 20
## 3 bic_full_additive_model 93.50005 0.6734407 16
## 4 brute_force 95.70173 0.6578804 16
## 5 full_interact 87.85819 0.7242581 187
## 6 combined_model 91.09935 0.7068179 31
## 7 smaller_combined_model 94.71747 0.6811681 18
## 8 full_log_model 93.69018 0.6889451 22
## 9 dist^2 93.65850 0.6892697 23
## 10 dist^3 93.68354 0.6890521 23
## 11 sqrt(dist) 93.58824 0.6898447 23
## 12 log(dist) 93.56241 0.6900572 23
## 13 1/dist 93.52151 0.6903879 23
## 14 poly 4 93.40063 0.6911002 25
## 15 metro_dist^2 93.68063 0.6889889 23
## 16 metro_dist^3 93.67733 0.6890096 23
## 17 sqrt(metro_dist) 93.68713 0.6889615 23
## 18 log(metro_dist) 93.68867 0.6889534 23
## 19 1/metro_dist 93.69047 0.6889453 23
## 20 cleanliness_rating^2 93.69926 0.6896434 23
## 21 cleanliness_rating^3 93.69926 0.6896434 23
## 22 sqrt(cleanliness_rating) 93.69926 0.6896434 23
## 23 log(cleanliness_rating) 93.69926 0.6896434 23
## 24 1/cleanliness_rating 93.69926 0.6896434 23
## 25 guest_satisfaction_overall^2 93.42177 0.6912219 23
## 26 guest_satisfaction_overall^3 93.42298 0.6912350 23
## 27 sqrt(guest_satisfaction_overall) 93.42103 0.6911926 23
## 28 log(guest_satisfaction_overall) 93.42111 0.6911799 23
## 29 1/guest_satisfaction_overall 93.42180 0.6911502 23
## 30 attr_index_norm^2 93.26696 0.6909701 23
## 31 attr_index_norm^3 93.33142 0.6905552 23
## 32 sqrt(attr_index_norm) 93.19488 0.6915052 23
## 33 log(attr_index_norm) 93.19333 0.6915609 23
## 34 1/attr_index_norm 93.24204 0.6913636 23
## 35 rest_index_norm^2 93.10515 0.6934284 23
## 36 rest_index_norm^3 93.16684 0.6929168 23
## 37 sqrt(rest_index_norm) 93.07934 0.6937721 23
## 38 log(rest_index_norm) 93.10136 0.6936802 23
## 39 1/rest_index_norm 93.19633 0.6931338 23
## 40 lat^2 93.45178 0.6896209 23
## 41 lat^3 93.45172 0.6896211 23
## 42 sqrt(lat) 93.69018 0.6889451 22
## 43 log(lat) 93.69018 0.6889451 22
## 44 1/lat 93.45194 0.6896202 23
## 45 lng^2 93.48710 0.6894534 23
## 46 lng^3 93.65829 0.6889698 23
## 47 sqrt(lng) 226.11375 0.9272024 21
## 48 log(lng) 226.19698 0.9256809 21
## 49 1/lng 93.67125 0.6892571 23
Assumption Analysis
Function to perform diagnostic tests of LINE assumptions
assumpt = function(model, pcol = "gray", lcol = "dodgerblue", alpha = 0.05){
par(mfrow=c(1,2))
plot(fitted(model), resid(model), col = pcol, pch = 20,
xlab = "Fitted", ylab = "Residuals",
main = paste("Fitted vs. Residuals for ", substitute(model), sep = ""))
abline(h = 0, col = lcol, lwd = 2)
qqnorm(resid(model), main = paste("Normal Q-Q Plot for ", substitute(model), sep = ""), col = pcol, pch = 20)
qqline(resid(model), col = lcol, lwd = 2)
}
assumpt(full_additive_mod)

assumpt(full_interact)

assumpt(combined_model)

assumpt(smaller_combined_model)

assumpt(brute_force)

table = data.frame()
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
tab = data.frame(Model = "full_additive_mod", Shapiro = unname(shapiro.test(resid(full_additive_mod))$p.value), BP = unname(bptest(full_additive_mod)$p.value))
row = data.frame(Model = "full_interact", Shapiro = unname(shapiro.test(resid(full_interact))$p.value), BP = unname(bptest(full_interact)$p.value))
tab = rbind(tab, row)
row = data.frame(Model = "combined_model", Shapiro = unname(shapiro.test(resid(combined_model))$p.value), BP = unname(bptest(combined_model)$p.value))
tab = rbind(tab, row)
row = data.frame(Model = "smaller_combined_model", Shapiro = unname(shapiro.test(resid(smaller_combined_model))$p.value),
BP = unname(bptest(smaller_combined_model)$p.value))
tab = rbind(tab, row)
row = data.frame(Model = "brute_force", Shapiro = unname(shapiro.test(resid(brute_force))$p.value), BP = unname(bptest(brute_force)$p.value))
tab = rbind(tab, row)
kable(tab)
| full_additive_mod |
0 |
0.0e+00 |
| full_interact |
0 |
3.8e-06 |
| combined_model |
0 |
0.0e+00 |
| smaller_combined_model |
0 |
0.0e+00 |
| brute_force |
0 |
0.0e+00 |
Removal of Unusual Values
In the above analysis, it looks as if the smaller_combined_model
works the best. We will now eliminate unusual values.
plot(smaller_combined_model)
## Warning: not plotting observations with leverage one:
## 161




unusual_index = cooks.distance(smaller_combined_model)>(4/nrow(london))
london_no_unusual = london[!unusual_index,]
london_no_unusual = na.omit(london_no_unusual)
nrow(london) - nrow(london_no_unusual)
## [1] 173
(nrow(london) - nrow(london_no_unusual))/nrow(london)
## [1] 0.04388635
max(london_no_unusual$realSum)
## [1] 833.0393
Removing points with a cooks distance > 4/n eliminates 248 points,
or 4.6% of the total number of rows.
library(lmtest)
smaller_combined_no_unusual = lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
+ bedrooms + dist + attr_index_norm + lng +
+ multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = london_no_unusual)
log_diag(smaller_combined_no_unusual, "smaller_combined_no_unusual")
## Warning in exp(fitted(model)) - london$realSum: longer object length is not a
## multiple of shorter object length
## model rmse r2 coefficients
## 1 smaller_combined_no_unusual 211.3619 0.7533451 20
assumpt(smaller_combined_no_unusual)

plot(smaller_combined_no_unusual)
## Warning: not plotting observations with leverage one:
## 1241


shapiro.test(resid(smaller_combined_no_unusual)[1:4980])
##
## Shapiro-Wilk normality test
##
## data: resid(smaller_combined_no_unusual)[1:4980]
## W = 0.99341, p-value = 4.023e-12
bptest(smaller_combined_no_unusual)
##
## studentized Breusch-Pagan test
##
## data: smaller_combined_no_unusual
## BP = 127.22, df = 19, p-value < 2.2e-16
What if we exclude points with large leverage?
leverage_index = hatvalues(smaller_combined_no_unusual)>(2*mean(hatvalues(smaller_combined_no_unusual)))
sum(leverage_index)
## [1] 210
fewer = london[!leverage_index,]
smaller_b = lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
+ bedrooms + dist + attr_index_norm + lng +
+ multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = fewer)
log_diag(smaller_b, "smaller_combined_no_unusual_b")
## Warning in exp(fitted(model)) - london$realSum: longer object length is not a
## multiple of shorter object length
## model rmse r2 coefficients
## 1 smaller_combined_no_unusual_b 209.8111 0.6897667 21
assumpt(smaller_b)

plot(smaller_b)
## Warning: not plotting observations with leverage one:
## 154


shapiro.test(resid(smaller_b)[1:4980])
##
## Shapiro-Wilk normality test
##
## data: resid(smaller_b)[1:4980]
## W = 0.97857, p-value < 2.2e-16
library(MASS)
smaller_combined_bc_nu = lm(((realSum^lambda-1)/lambda) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
+ bedrooms + dist + attr_index_norm + lng +
+ multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = london_no_unusual)
assumpt(smaller_combined_bc_nu)

hist(((london$realSum^lambda-1)/lambda), prob = TRUE)
curve(dnorm(x, mean = mean(((london$realSum^lambda-1)/lambda)), sd = sd(((london$realSum^lambda-1)/lambda))),
col = "darkorange", add = TRUE, lwd = 3)
residuals = unname(resid(smaller_combined_bc_nu))[1:4998]
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.99299, p-value = 1.285e-12
resid = invBoxCox(fitted(smaller_combined_bc_nu), lambda) - london$realSum
## Warning in invBoxCox(fitted(smaller_combined_bc_nu), lambda) - london$realSum:
## longer object length is not a multiple of shorter object length
rmse = sqrt(sum(resid^2)/nrow(london))
r2 = summary(smaller_combined_bc_nu)$r.squared
coefs = nrow(summary(smaller_combined_bc_nu)$coeff)
data.frame(model = "smaller_combined_bc_nu", rmse = rmse, r2 = r2, coefficients = coefs )
## model rmse r2 coefficients
## 1 smaller_combined_bc_nu 211.2203 0.7542762 20

Model Evaluation
sample = sample(c(TRUE, FALSE), nrow(london_no_unusual), replace=TRUE, prob=c(0.7,0.3))
london_train = london_no_unusual[sample, ]
london_test = london_no_unusual[!sample, ]
log_predict_diag = function(true_data, fit_data, model, dataset){
resid = exp(unname(fit_data)) - unname(true_data)
rmse = sqrt(sum(resid^2)/length(fit_data))
data.frame(model = model, dataset = dataset, rmse = rmse)
}
smaller_combined_no_unusual_train = lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
+ bedrooms + dist + attr_index_norm + lng +
+ multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = london_train)
test_fit = predict(smaller_combined_no_unusual_train, london_test)
eval_results = log_predict_diag(london_test$realSum, test_fit, "smaller_combined_no_unusual_test", "test")
eval_results
## model dataset rmse
## 1 smaller_combined_no_unusual_test test 78.23421
train_fit = predict(smaller_combined_no_unusual_train, london_train)
row = log_predict_diag(london_train$realSum, train_fit, "smaller_combined_no_unusual_train", "training")
row
## model dataset rmse
## 1 smaller_combined_no_unusual_train training 77.69433
eval_results = rbind(eval_results, row)
plot(exp(unname(test_fit)) ~ unname(london_test$realSum))
abline(1,1)

full_additive_no_unusual_train = lm(log(realSum) ~ ., data = london_train)
test_fit = predict(full_additive_no_unusual_train, london_test)
row = log_predict_diag(london_test$realSum, test_fit, "full_additive_no_unusual_test", "test")
row
## model dataset rmse
## 1 full_additive_no_unusual_test test 78.19106
eval_results = rbind(eval_results, row)
train_fit = predict(full_additive_no_unusual_train, london_train)
row = log_predict_diag(london_train$realSum, train_fit, "full_additive_no_unusual_train", "training")
row
## model dataset rmse
## 1 full_additive_no_unusual_train training 77.96598
eval_results = rbind(eval_results, row)
full_interact_no_unusual_train = lm(log(realSum) ~ .^2, data = london_train)
test_fit = predict(full_interact_no_unusual_train, london_test)
## Warning in predict.lm(full_interact_no_unusual_train, london_test): prediction
## from rank-deficient fit; attr(*, "non-estim") has doubtful cases
row = log_predict_diag(london_test$realSum, test_fit, "full_interact_no_unusual_train", "test")
row
## model dataset rmse
## 1 full_interact_no_unusual_train test 79.6976
eval_results = rbind(eval_results, row)
train_fit = predict(full_interact_no_unusual_train, london_train)
## Warning in predict.lm(full_interact_no_unusual_train, london_train): prediction
## from rank-deficient fit; attr(*, "non-estim") has doubtful cases
row = log_predict_diag(london_train$realSum, train_fit, "full_interact_no_unusual_train", "training")
row
## model dataset rmse
## 1 full_interact_no_unusual_train training 72.49799
eval_results = rbind(eval_results, row)
kable(eval_results)
| smaller_combined_no_unusual_test |
test |
78.23421 |
| smaller_combined_no_unusual_train |
training |
77.69433 |
| full_additive_no_unusual_test |
test |
78.19106 |
| full_additive_no_unusual_train |
training |
77.96598 |
| full_interact_no_unusual_train |
test |
79.69760 |
| full_interact_no_unusual_train |
training |
72.49799 |
As we can see from the correlation plot using the predicted values
from smaller_combined_no_unusual_train, the model seems to fall down and
underestimate the true values of higher priced rooms. Maybe there is
some intangible reason for the expense of these listings that cannot be
explained by the predictors. What happens if we exclude the listings
with realSum values more than 1000?
london_no_unusual_no_outliers = london_no_unusual[!(london_no_unusual$realSum >1000),]
hA = hist(log(london_no_unusual_no_outliers$realSum), prob = TRUE, plot = FALSE)
## Warning in hist.default(log(london_no_unusual_no_outliers$realSum), prob =
## TRUE, : argument 'probability' is not made use of
hB = hist(log(london_no_unusual$realSum), prob = TRUE, plot = FALSE)
## Warning in hist.default(log(london_no_unusual$realSum), prob = TRUE, plot =
## FALSE): argument 'probability' is not made use of
plot(hB, col = "dodgerblue")
plot(hA, col = "darkorange", add = TRUE)

smaller_combined_no_unusual_no_outliers_mod = lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
+ bedrooms + dist + log(dist) + attr_index_norm + lng +
+ multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = london_no_unusual_no_outliers)
plot(smaller_combined_no_unusual_no_outliers_mod)
## Warning: not plotting observations with leverage one:
## 1241




test_fit = predict(smaller_combined_no_unusual_no_outliers_mod, london_no_unusual_no_outliers)
plot(exp(unname(test_fit)) ~ unname(london_no_unusual_no_outliers$realSum))
abline(1,1)

resid = exp(fitted(smaller_combined_no_unusual_no_outliers_mod)) - london_no_unusual_no_outliers$realSum
rmse = sqrt(sum(resid^2)/nrow(london_no_unusual_no_outliers))
r2 = summary(smaller_combined_no_unusual_no_outliers_mod)$r.squared
coefs = nrow(summary(smaller_combined_no_unusual_no_outliers_mod)$coeff)
data.frame(model = "smaller_combined_no_unusual_no_outliers_mod", rmse = rmse, r2 = r2, coefficients = coefs )
## model rmse r2 coefficients
## 1 smaller_combined_no_unusual_no_outliers_mod 77.64277 0.7545 21
shapiro.test(resid(smaller_combined_no_unusual_no_outliers_mod)[1:4999])
##
## Shapiro-Wilk normality test
##
## data: resid(smaller_combined_no_unusual_no_outliers_mod)[1:4999]
## W = 0.99339, p-value = 3.849e-12
bptest(smaller_combined_no_unusual_no_outliers_mod)
##
## studentized Breusch-Pagan test
##
## data: smaller_combined_no_unusual_no_outliers_mod
## BP = 127.67, df = 20, p-value < 2.2e-16
nrow(london_no_unusual_no_outliers)
## [1] 3769
Results
mean(london$realSum)
## [1] 309.1503
The simple fully additive model, full_additive_mod, provided for a
high RMSE of 372.2035 and an \(R^2\)
value of 0.2768903 when predicting the Airbnb . This seems quite poor,
as the average realSum value is 364.3897. We tried a number of methods
to improve this model, including removing unusual values, AIC, and BIC,
without much improvement. We tried having the integer predictors as
factors and not as factors. This did not help. Looking at the pairs
plot, the relationship between many of the variables appears nonlinear,
indicating that the addition of predictor interactions and
transformations may be helpful.
The most significant, by far, improvement in our modelling was was
the log transform of realSum. This model full_log_model had a much
improved \(R^2\), but still high RMSE,
at 0.6877577 and 371.2015, respectively.
plot(fitted(full_additive_mod), london$realSum)
abline(1,1)

plot(exp(fitted(full_log_model)), london$realSum)
abline(1,1)

As you can see from the plots above, for both the full_additive_mod
and full_log_model, there are still a large number of outliers.
We tried a number of other transformations, none of which had much of
an impact on the apparent fit of the model. Next, we tried the full
interaction model. This improved the \(R^2\) to 0.7335292, but RMSE was about the
same. This model also showed evidence of overfitting, as the RMSE went
up significantly when used for prediction. In the end, we used BIC to
reduce the full interaction model to a more tractable model
smaller_combined_model. The smaller_combined_model was the mainstay of
the rest of our analysis.
Our suspicion grew that unusual observations may be leading to large
RMSE values. This can be seen in
smaller_combined_model = lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
+ bedrooms + dist + attr_index_norm + lng +
+ multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = london)
plot(smaller_combined_model)
## Warning: not plotting observations with leverage one:
## 161




4/nrow(london)
## [1] 0.001014713
In the end, we eliminated 248 rows of data that had a cook’s distance
of over 4/n. This improved model fit significantly, and RMSE values
decreased to the 110s for many of our models.
smaller_combined_no_unusual_train = lm(log(realSum) ~ room_type + person_capacity + cleanliness_rating + guest_satisfaction_overall
+ bedrooms + dist + attr_index_norm + lng +
+ multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm, data = london_train)
test_fit = predict(smaller_combined_no_unusual_train, london_test)
plot(exp(unname(test_fit)) ~ unname(london_test$realSum))
abline(1,1)

One difficulty that we had was assuring ourselves that our model did
not violate the assumptions of constant variance and normality. Our best
model, smaller_combined_no_unusual_train, did look to have a fairly good
Q-Q plot and Predicted vs. Residuals plot.
assumpt(smaller_combined_no_unusual)

This model did not pass the Shapiro-Wilke test, but apparently this
test does not work with large sample sizes, leading to erroneous
rejection of the null hypothesis of normal variance.
We also did try a Box-Cox transformation that did not help much over
the log transformation of the response variable.
In the end our model consisting of the predictors: room_type +
person_capacity + cleanliness_rating + guest_satisfaction_overall +
bedrooms + dist + attr_index_norm + lng + +
multi:guest_satisfaction_overall + biz:cleanliness_rating
+ cleanliness_rating:guest_satisfaction_overall
+ dist:rest_index_norm + metro_dist:attr_index_norm seemed optimal in
terms of coming close to satisfying the LINE assumptions, providing low
RMSE, and being interpretable. The main predictors, including room_type,
person_capacity, cleanliness_rating, guest_satisfaction, and bedrooms
seem like things that would impact the cost of a room. Also, distance to
the city and attractions would make a room more desirable. The
interactions make sense as well, implying synergies between predictors
such as distance to the metro and attractions.